This iSoMAs package contains the proposed “iSoMAs” (iSoform
expression and somatic Mutation Association) algorithm (function
iSoMAs), an efficient computational pipeline based on
principal component analysis (PCA) techniques for exploring the
association between somatic single nucleotide variant (SNV) and gene
isoform expression through both cis- and trans-regulatory mechanisms. We
have shown that iSoMAs is more efficient and versatile than existing
methods in identification of SNV-isoform expression associations. Since
iSoMAs implements differential analysis along a small number of PC
coordinates (always ≤50) rather than along thousands of original
transcript axes directly, it dramatically outperforms the existing
models in efficiency. Furthermore, iSoMAs searches for the
SNV-associated isoform expression over the whole transcriptome
simultaneously, indicating its versatility compared to previous studies.
Most importantly, iSoMAs overcomes the limitation of the low mutation
frequencies of most cancer genes by examining the association between an
SNV and the meta-isoform expression quantified as the PC score, well
reducing the false positive rate incurred by the traditional
gene-by-gene association study.
iSoMAs integrates sample-matched RNA-seq and DNA-seq data to study
the association between gene somatic mutations and gene isoform
expression. The iSoMAs workflow consists of two steps: In the first
step, the high-dimensional gene isoform expression matrix
(d=59,866 derived from the 15,448
multi-isoform genes) for each cancer type is trimmed by mean.var.plot
method (built in the Seurat toolkit) into a more informative expression
matrix, which keeps only the most variable isoforms but is still
high-dimensional (d~3,500). After that, the Principal
Component Analysis (PCA) is performed to further reduce the dimension of
the informative expression matrix into a much lower-dimensional PC score
matrix (d<=50) by calculating a PC loading matrix. Each
column of the PC loading matrix performs a particular linear combination
of the top variable isoforms into a meta-isoform, with the combination
coefficients stored in the corresponding column of the PC loading
matrix. All the meta-isoforms comprise the coordinates of the new
low-dimensional space. In the second step, a differential PC score
analysis is conducted along each of the 50 PC-coordinates based on the
mutation status of the studied gene by Wilcoxon rank-sum test. Following
the differential PC score analysis, the significant genes (termed
candidate iSoMAs genes) are identified if the minimum of the 50 p-values
[defined as minP = min{P1,P2,...,P50}] is smaller than a
predefined threshold (i.e.,minP<1e-3). The iSoMAs genes
are eventually determined after a two-step multiple testing correction
procedure on all candidate iSoMAs genes.
In this tutorial, we guide users to execute iSoMAs and
visualize/interpret the iSoMAs results step-by-step. This tutorial was
implemented on R version 4.4.0 (2024-04-24) on macOS Sonoma 14.4.1. The
latest developmental version of iSoMAs can be downloaded from GitHub and
installed from source by
devtools::install_github('elnitskilab/iSoMAs'). After
installation, we load the package as below:
library(iSoMAs)
Four datasets are required as iSoMAs inputs: a isoform expression
matrix (data.iso), a gene somatic mutation table as a maf
(mutation annotation format) (data.maf), two mapping files
between gene names and their associated isoforms
(gene_to_iso, iso_to_gene). iSoMAs
automatically determines a qualified list of genes to test
(genes_to_test) if not designated [In this demo, we test
only the first 100 genes in the list of genes_to_test, and therefore you
may see a little difference than the manuscript]. An example of the four
datasets collected from TCGA-LUAD project have been incorporated into
this package and can be used directly. Users can also download them from
the following sites and load to the work space manually.
data.iso [data.frame: 73,599 * 576]: https://drive.google.com/file/d/1pE4L7mkuUy_Ry6-9xMP7acx4ElSLibcD/view?usp=share_link
data.maf [data.frame: 208,180 * 120]: https://drive.google.com/file/d/14inFBJhHItABaA0-1gs5z11gO7NeflY_/view?usp=share_link
gene_to_iso [list of list: 29,181 elements], iso_to_gene [list: 73,599 elements]: https://drive.google.com/file/d/1_Ua0tCoLbtgIgn3Lfemz3OfEzz2hl4Z9/view?usp=share_link’
res_isomas = iSoMAs(data.iso,data.maf,gene_to_iso,iso_to_gene,
genes_test=NULL,
stat.method="wilcox",
ntop.gene=100, minP.sorted = T,
return.Seurat = T,
filename = "res_iSoMAs_TCGA-LUAD.RData")
#> Loading required package: SeuratObject
#> Loading required package: sp
#>
#> Attaching package: 'SeuratObject'
#> The following objects are masked from 'package:base':
#>
#> intersect, t
#> dim of trimmed data.iso: 59866, 517
#> min.cells=25; min.features=5986
#> dim of iso: 53963, 517
#> Normalizing layer: counts
#> Finding variable features for layer data
#> selection.method of mean.var.plot (mvp) for FindVariableFeatures was used,
#> length of VariableFeatures(iso): 3315
#> Centering and scaling data matrix
#> PC_ 1
#> Positive: uc001tnm.2, uc001roo.2, uc002yse.1, uc002qzc.2, uc001tld.2, uc001cuj.2
#> Negative: uc002ent.2, uc010rfi.1, uc010vml.1, uc010qoh.1, uc001csp.3, uc002yhe.2
#> PC_ 2
#> Positive: uc002wwp.1, uc011crd.1, uc002afz.2, uc003gpp.2, uc004dyg.2, uc002jvg.2
#> Negative: uc001exx.2, uc002cyk.3, uc002yzj.2, uc003hus.3, uc009xcb.1, uc003axb.2
#> PC_ 3
#> Positive: uc002rao.1, uc001gfr.1, uc011kql.1, uc001qnw.2, uc001lfv.2, uc002eel.2
#> Negative: uc002okz.3, uc003eet.2, uc004bgq.2, uc010jee.2, uc001jwc.2, uc002zib.1
#> PC_ 4
#> Positive: uc003uuk.2, uc002wbu.2, uc002imv.2, uc002yeo.1, uc001col.1, uc002imq.1
#> Negative: uc002gqb.3, uc010gck.2, uc002vko.2, uc010vwi.1, uc011del.1, uc001uzz.1
#> PC_ 5
#> Positive: uc003xhm.2, uc002wev.3, uc011kyp.1, uc002zhg.2, uc003bne.1, uc001jsp.2
#> Negative: uc002ubz.2, uc001jwj.2, uc003jiy.2, uc003oyt.2, uc003eqv.1, uc002ukv.3
#> Computing nearest neighbor graph
#> Computing SNN
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 517
#> Number of edges: 20552
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.6556
#> Number of communities: 3
#> Elapsed time: 0 seconds
#> Conducting MAF_PCA...
#> genes_test is NULL, generating genes_test from data.mafmut.samp.thres = 11
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> has 199580 SNP variants with varType designated
#> top 10 mutated genes:
#>
#> TTN MUC16 RYR2 CSMD3 LRP1B USH2A ZFHX4 TP53 XIRP2 FLG
#> 877 500 449 408 344 337 291 255 245 244
#> bottom 10 mutated genes:
#>
#> ZNF527 ZNF572 ZNF582 ZNF618 ZNF624 ZNF746 ZNF75D ZNF783 ZXDA ZYG11B
#> 11 11 11 11 11 11 11 11 11 11
#> truncating to top 100 genes
#> genes_test: 100---TTN, MUC16, RYR2, CSMD3, LRP1B, USH2A, ZFHX4, TP53, XIRP2, FLG, SPTA1, FAT3, PCDH15, ZNF536, RYR3, CSMD1, MUC17, ANK2, COL11A1, DST...
#> OK
The output of iSoMAs res_isomas is a 21-entry list
containing the input arguments, the PCA and differential PC score
analysis results:
str(res_isomas,max.level = 1)
#> List of 21
#> $ pca :Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
#> $ pvals_all :'data.frame': 100 obs. of 53 variables:
#> $ pvals_all_sorted :'data.frame': 100 obs. of 54 variables:
#> $ genes_test : chr [1:100] "TTN" "MUC16" "RYR2" "CSMD3" ...
#> $ min.cells : num 25
#> $ min.features : num 5986
#> $ dim_data.iso : int [1:2] 59866 517
#> $ dim_iso : num [1:2] 53963 517
#> $ project : chr "TCGA-LUAD"
#> $ mut.samp.thres : num 11
#> $ varType : chr "SNP"
#> $ varClassification : NULL
#> $ group1 : chr "WildType"
#> $ group2 : chr "Mutant"
#> $ nPCs : num 50
#> $ res.cluster : num 0.5
#> $ scale.factor : num 10000
#> $ normalization.method: chr "LogNormalize"
#> $ selection.method : chr "mvp"
#> $ iso :Formal class 'Seurat' [package "SeuratObject"] with 13 slots
#> $ time_stamp : POSIXct[1:1], format: "2025-02-06 15:49:59"
The first entry pca of the iSoMAs output is a
SeuratObject generated by the Seurat package, in which
cell.embeddings refers to the PC score matrix with samples
in rows and PC coordinates in columns, and feature.loadings
refers to the PC loading matrix with features (isoforms) in rows and PC
coordinates in columns. PC coordinates are denoted as {PC_1, PC_2, …,
PC_50}.
str(res_isomas$pca,max.level = 2)
#> Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
#> ..@ cell.embeddings : num [1:517, 1:50] 9.14 4.926 -0.396 10.69 -32.874 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..@ feature.loadings : num [1:3315, 1:50] 2.23e-03 6.04e-03 3.10e-04 6.64e-03 8.08e-05 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..@ feature.loadings.projected: num[0 , 0 ]
#> ..@ assay.used : chr "RNA"
#> ..@ global : logi FALSE
#> ..@ stdev : num [1:50] 18.26 14.75 12.41 10.72 9.69 ...
#> ..@ jackstraw :Formal class 'JackStrawData' [package "SeuratObject"] with 4 slots
#> ..@ misc :List of 1
#> ..@ key : chr "PC_"
print(res_isomas$pca@cell.embeddings[1:5,1:3])
#> PC_1 PC_2 PC_3
#> TCGA-55-1592-01A-01R-0946-07 9.1401974 -3.673352 6.967802
#> TCGA-67-6215-01A-11R-1755-07 4.9262818 -13.472773 9.427592
#> TCGA-05-4422-01A-01R-1206-07 -0.3959655 -7.028158 19.569700
#> TCGA-55-A493-01A-11R-A24H-07 10.6896281 18.969092 -15.307253
#> TCGA-86-8585-01A-11R-2403-07 -32.8743034 9.892457 -6.697670
print(res_isomas$pca@feature.loadings[1:5,1:3])
#> PC_1 PC_2 PC_3
#> uc003hgs.3 2.227236e-03 -0.0019821681 0.003704788
#> uc001yse.2 6.038455e-03 0.0117939759 -0.017182606
#> uc001lsz.2 3.099701e-04 -0.0058789395 0.013615087
#> uc001lvg.2 6.641828e-03 0.0051092313 0.002970942
#> uc003hgx.3 8.081053e-05 0.0003585252 0.001103470
The pvals_all_sorted lists the tested genes ranked by
minP.
nc = ncol(res_isomas$pvals_all_sorted)
print(head(res_isomas$pvals_all_sorted[,c(1:4,(nc-1):nc)]))
#> Mutant Other WildType PC_1 PC_50 minP
#> TP53 213 37 267 0.140 0.00077 6.6e-15
#> CSMD3 199 15 303 0.850 0.07800 6.1e-14
#> KRAS 139 0 378 0.075 0.44000 6.3e-11
#> TTN 248 28 241 0.580 0.54000 1.2e-10
#> ZFHX4 165 8 344 0.800 0.26000 8.0e-09
#> DNAH11 66 2 449 0.610 0.05700 4.1e-07
In the following sections, we use simple examples to show how to interpret the iSoMAs output. Since TP53 was detected as a iSoMAs gene in most cancer types (18 out of 33). We will always use TP53 as an example if applicable. First, we extract the p-value table and determine the PC-coordinates along which TP53 was tested significant in the differential PC score analysis for subsequent use:
myGene.mut = "TP53"
project = res_isomas$project
pvals_all = res_isomas$pvals_all
pvals_sig = get_pvals_sig(pvals_all,myGene = myGene.mut)
#> PC_2 PC_4 PC_5 PC_50
#> TP53 6.6e-15 4.5e-10 0.00017 0.00077
PCs_sig = colnames(pvals_sig)
For each gene tested, the minP along all 50 PC
coordinates is first obtained.Then, all tested genes are ranked based on
their minP. Only genes with minP<pval.thres
were considered in this analysis. Users can control the number of top
iSoMAs genes (idx.plot) to show.
mat.plot = plot_isomas_pvals(pvals_all,sorted = T,pval.thres = 1e-3,
showNsamp = T,idx.plot = 10,
p.log10 = T,
show_rwnms = T,lgd = T,plot.sig = T,
get.data = T,scale.by = "none",fs = 16,
title = project)
#> significant p-values: 77
#> TP53 CSMD3 KRAS TTN ZFHX4 DNAH11
#> 6.6e-15 6.1e-14 6.3e-11 1.2e-10 8.0e-09 4.1e-07
#> PC_1 PC_2 PC_49 PC_50
#> TP53(213/267, 6.6e-15) 0.140 6.6e-15 0.7100 0.00077
#> CSMD3(199/303, 6.1e-14) 0.850 6.1e-14 0.5000 0.07800
#> KRAS(139/378, 6.3e-11) 0.075 2.9e-01 0.0086 0.44000
#> TTN(248/241, 1.2e-10) 0.580 1.2e-10 0.9400 0.54000
#> ZFHX4(165/344, 8e-09) 0.800 8.0e-09 0.6500 0.26000
#> DNAH11(66/449, 4.1e-07) 0.610 4.1e-07 0.5900 0.05700
#> plot significant only
#> PC_2 PC_3 PC_21 PC_50
#> TP53(213/267, 6.6e-15) 6.6e-15 0.01100 0.30000 0.00077
#> CSMD3(199/303, 6.1e-14) 6.1e-14 0.91000 0.98000 0.07800
#> KRAS(139/378, 6.3e-11) 2.9e-01 0.00033 0.00014 0.44000
#> TTN(248/241, 1.2e-10) 1.2e-10 0.85000 0.61000 0.54000
#> ZFHX4(165/344, 8e-09) 8.0e-09 0.90000 0.25000 0.26000
#> DNAH11(66/449, 4.1e-07) 4.1e-07 0.87000 0.40000 0.05700
#> plot -log10(p-values)
Differential PC score analysis along significant PC-coordinates upon gene somatic mutation. Users can control the statistic method (Wilcoxon or t-test) and the type of p-value to show (significance level or p-value itself) and other visualization options:
plot_PCscore_along_PCs(res_isomas,data.maf,myGene.mut,PCs=PCs_sig,label.comp="p.signif",
title.legend=myGene.mut,title.plot=project,fs=12,sz=8,
legend.position = c(0.75,0.15),
xlab = NULL,method = "wilcox",
legend.direction = "vertical")
#> Loading required package: ggplot2
#> TP53 has 294 variants in total
#> TP53 has 255 SNP variants with varType designated
#>
#> Mutant Other WildType
#> 213 37 267
#> Using Mutation as id variables
We can also show the weights (in PCA: PC loading) of features (in this study: isoforms) along each designated PC-axis. In this specific example, features (isoforms) are ranked by the feature loadings (see res_isomas$pca@feature.loadings) along PC_2.
plot_feature_loadings_along_PCs(res_isomas,PCs=PCs_sig,PC.ord = 'PC_2',fs=16)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Heatmap shows scaled expression levels of the top positive and
negative isoforms of all samples ordered by their PC scores along PC_2
axis in LUAD cancer. The isoforms were ranked by the PC loading value
along PC_2 axis. In this example, we include the TP53 mutation status of
samples in the column annotation bar. Users can add new columns to
iso@meta.data to account for other sample information
(e.g., survival, drug response, tumor stage, tumor purity, etc.). This
figure clearly shows the enrichment of samples with TP53 mutation in
samples with high level of PC score, high level of some genes (e.g.,
TPX2) and low level of some other genes (e.g., SELENBP2).
# iso = run_PCA_Seurat(data.iso=trim_features_and_samples_TCGA(data.iso,gene_to_iso,iso_to_gene),res_isomas=res_isomas)
Mutations = get_maf_of_interest_TCGA(data.maf,myGene=myGene.mut,mySamples=rownames(res_isomas$iso@meta.data))
#> TP53 has 294 variants in total
#> TP53 has 255 SNP variants with varType designated
#>
#> Mutant Other WildType
#> 213 37 267
res_isomas$iso@meta.data$Mutation = Mutations$Mutation
data.heat = plot_isomas_heatmap(res_isomas,dims = PCs_sig[1], nfeat.pos = 6, nfeat.neg = 6,
anno_row = T, anno_col = T, anno_col_keys = c("Mutation"),
anno_legend = T, get.data = F,fs = 16,
show_rwnms = T,plot.heat=T,
iso_to_gene = iso_to_gene, show_gname = T,
title.plot=paste0(project,", ", myGene.mut))
The function plot_correlation_PCscore_isoExpr generates
a volcano-style figure showing the Spearman correlation between the PC
score along a particular PC-axis and the expression level of the top
isoforms associated with a iSoMAs gene. The x and y axes represent the
Spearman correlation coefficient () and the
-log10(p-value) respectively. The significant correlation
between PC score and isoform expression level is critical for ensuring
that those top isoforms are also significantly associated with the
iSoMAs gene mutation status as the PC score does (with high
probability).
plot_correlation_PCscore_isoExpr(res_isomas,nfeat.pos = NULL, nfeat.neg = NULL, nfeat.pos.mark=50, nfeat.neg.mark=50, PCs=PCs_sig[1])
iSoMAs identified the somatic mutations associated gene expression
alternations in an indirect way. Specifically, considering the linear
combination property in PCA, we assume that the top isoforms (those with
largest absolute PC loading values) contribute most to the significant
association between the mutations status of a gene and the level of PC
score. In this sense, the top isoforms are more likely to be
significantly associated with the mutation status of this gene compared
to the bottom ones. The function
plot_iso_expr_upon_gene_mutation checks the differential
expression (at the isoform-level) upon gene mutation directly, which is
a straightfoward verification of the iSoMAs results.
data.iso.mut = plot_iso_expr_upon_gene_mutation(
data.iso,Mutations,gene_to_iso, myGene.mut=myGene.mut,
myGene.iso='TPX2', isoIDs = NULL, rmNorm = T,
plot.percent = F,Log2 = T,normData = F,
label = "p.signif", stat.method = "wilcox",
legend.position = 'right',plot.gene = F,
angle.p = 0,label.y.npc = 1,
legend.direction = "vertical",fs = 16,sz=10,
get.data=T, title=project)
#> after removing normals: 517, 2
#> 'data.frame': 960 obs. of 3 variables:
#> $ Mutation: Ord.factor w/ 2 levels "WildType(267)"<..: 2 1 1 2 2 1 1 2 1 1 ...
#> $ Isoform : Factor w/ 2 levels "uc002wwp.1","uc010gdv.1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Log2FPKM: num 10.45 9.54 8.84 12.42 11.19 ...
#> NULL
In this section, we regenerate the PC score with the top transcripts
(gene expression at isoform-level). Users can designate the number of
top isoforms used to approximate the PC score along a specific PC-axis.
We will see that using all features (in this example, 3315
variable isoforms) completely regenerates the PC score.
res_PCapprox = do_PCscore_approximation(res_isomas,myGene.mut = myGene.mut,
add.method = 'accumulatively',
sign.PC='both',verbose = 0)
#> PC_2 PC_4 PC_5 PC_50
#> TP53 6.6e-15 4.5e-10 0.00017 0.00077
#> scale.data: 53963, 517
#> scale.data.loadings: 517, 3315
#> feature.loadings: 3315, 50
plist = plot_PCscore_approximation(res_PCapprox)
#> Loading required package: ggpp
#> Registered S3 methods overwritten by 'ggpp':
#> method from
#> heightDetails.titleGrob ggplot2
#> widthDetails.titleGrob ggplot2
#>
#> Attaching package: 'ggpp'
#> The following objects are masked from 'package:ggpubr':
#>
#> as_npc, as_npcx, as_npcy
#> The following object is masked from 'package:ggplot2':
#>
#> annotate
#> Registered S3 method overwritten by 'ggpmisc':
#> method from
#> as.character.polynomial polynom
ggpubr::ggarrange(plotlist = plist[1:6],ncol = 3,nrow = 2)
Hua Tan, Valer Gotea, Sushil K. Jaiswal, Nancy E. Seidel, David O. Holland, Kevin Fedkenheuer, Abdel G. Elkahloun, Sara R. Bang-Christensen, Laura Elnitski. iSoMAs: Finding isoform expression and somatic mutation associations in human cancers. PLOS Computational Biology 2025, accepted.